My model so far has achieved a Validation Adjusted R^2 of 0.8327 and a Validation R^2 of 0.8596 .
The original adjusted R^2 was 0.9134.
library(readr)
library(ggplot2)
library(pander)
library(tidyverse)
library(plotly)
library(reshape2)
train <- read.csv("../Data/train.csv", stringsAsFactors = TRUE)
train <- train %>%
mutate(TotalSF = X1stFlrSF + X2ndFlrSF + TotalBsmtSF,
RichNbrhd = case_when(Neighborhood %in% c("StoneBr", "NridgHt", "NoRidge") ~ 1, TRUE ~ 0),
Alley = replace_na(as.character(Alley), "None"),
Alley = as.factor(Alley))
set.seed(121)
num_rows <- 1000
keep <- sample(1:nrow(train), num_rows)
mytrain <- train[keep, ]
mytest <- train[-keep, ]
lm_model <- lm(SalePrice ~ TotalSF + RichNbrhd + YearBuilt + WoodDeckSF + FullBath + BsmtQual + Neighborhood + HouseStyle + OverallQual + OverallCond + BsmtCond + TotalBsmtSF +
TotalSF:RichNbrhd + TotalSF:Fireplaces + TotalSF:Neighborhood + TotalSF:OverallCond + TotalSF:TotalBsmtSF, data=mytrain)
pander(summary(lm_model))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1111552 | 176639 | -6.293 | 4.868e-10 |
| TotalSF | 80.41 | 39.38 | 2.042 | 0.04147 |
| RichNbrhd | -121784 | 113843 | -1.07 | 0.285 |
| YearBuilt | 512.8 | 69.78 | 7.348 | 4.527e-13 |
| WoodDeckSF | 16.41 | 6.951 | 2.36 | 0.01847 |
| FullBath | 1206 | 2543 | 0.474 | 0.6356 |
| BsmtQualFa | -33407 | 7338 | -4.553 | 6.03e-06 |
| BsmtQualGd | -36587 | 3914 | -9.347 | 6.933e-20 |
| BsmtQualTA | -35416 | 4833 | -7.328 | 5.194e-13 |
| NeighborhoodBlueste | 21566 | 48495 | 0.4447 | 0.6566 |
| NeighborhoodBrDale | 86632 | 120906 | 0.7165 | 0.4739 |
| NeighborhoodBrkSide | 119843 | 111929 | 1.071 | 0.2846 |
| NeighborhoodClearCr | 63954 | 118041 | 0.5418 | 0.5881 |
| NeighborhoodCollgCr | 86155 | 110955 | 0.7765 | 0.4377 |
| NeighborhoodCrawfor | 136357 | 111803 | 1.22 | 0.2229 |
| NeighborhoodEdwards | 167508 | 110824 | 1.511 | 0.131 |
| NeighborhoodGilbert | 52224 | 112179 | 0.4655 | 0.6417 |
| NeighborhoodIDOTRR | 94784 | 115410 | 0.8213 | 0.4117 |
| NeighborhoodMeadowV | 134033 | 112500 | 1.191 | 0.2338 |
| NeighborhoodMitchel | 117917 | 111938 | 1.053 | 0.2924 |
| NeighborhoodNAmes | 136988 | 110718 | 1.237 | 0.2163 |
| NeighborhoodNoRidge | 56268 | 35251 | 1.596 | 0.1108 |
| NeighborhoodNPkVill | 166470 | 172379 | 0.9657 | 0.3344 |
| NeighborhoodNridgHt | 20768 | 34308 | 0.6053 | 0.5451 |
| NeighborhoodNWAmes | 99850 | 111689 | 0.894 | 0.3716 |
| NeighborhoodOldTown | 131950 | 110993 | 1.189 | 0.2348 |
| NeighborhoodSawyer | 140505 | 111575 | 1.259 | 0.2083 |
| NeighborhoodSawyerW | 101359 | 111589 | 0.9083 | 0.3639 |
| NeighborhoodSomerst | 70926 | 111416 | 0.6366 | 0.5246 |
| NeighborhoodSWISU | 160927 | 112443 | 1.431 | 0.1527 |
| NeighborhoodTimber | 36912 | 112948 | 0.3268 | 0.7439 |
| NeighborhoodVeenker | -115610 | 126479 | -0.9141 | 0.3609 |
| HouseStyle1.5Unf | 3455 | 8500 | 0.4064 | 0.6845 |
| HouseStyle1Story | 9225 | 3895 | 2.369 | 0.01807 |
| HouseStyle2.5Fin | -582 | 13263 | -0.04388 | 0.965 |
| HouseStyle2.5Unf | -121.3 | 11037 | -0.01099 | 0.9912 |
| HouseStyle2Story | -219.6 | 3566 | -0.06159 | 0.9509 |
| HouseStyleSFoyer | 12212 | 6876 | 1.776 | 0.07609 |
| HouseStyleSLvl | 15405 | 5562 | 2.77 | 0.005724 |
| OverallQual | 10408 | 1152 | 9.033 | 9.987e-19 |
| OverallCond | -1166 | 2703 | -0.4313 | 0.6663 |
| BsmtCondGd | 2819 | 6253 | 0.4508 | 0.6522 |
| BsmtCondPo | 3826 | 19550 | 0.1957 | 0.8449 |
| BsmtCondTA | 3185 | 5051 | 0.6306 | 0.5284 |
| TotalBsmtSF | -15.57 | 10.31 | -1.511 | 0.1311 |
| TotalSF:RichNbrhd | 48.05 | 39.58 | 1.214 | 0.2251 |
| TotalSF:Fireplaces | 3.728 | 0.5821 | 6.403 | 2.446e-10 |
| TotalSF:NeighborhoodBrDale | -34.98 | 47.49 | -0.7365 | 0.4616 |
| TotalSF:NeighborhoodBrkSide | -37.8 | 39.79 | -0.95 | 0.3424 |
| TotalSF:NeighborhoodClearCr | -15.75 | 41.12 | -0.383 | 0.7018 |
| TotalSF:NeighborhoodCollgCr | -25.43 | 38.95 | -0.653 | 0.5139 |
| TotalSF:NeighborhoodCrawfor | -39.72 | 39.18 | -1.014 | 0.311 |
| TotalSF:NeighborhoodEdwards | -64.66 | 38.92 | -1.661 | 0.09699 |
| TotalSF:NeighborhoodGilbert | -14.13 | 39.51 | -0.3577 | 0.7206 |
| TotalSF:NeighborhoodIDOTRR | -29.52 | 42.58 | -0.6934 | 0.4882 |
| TotalSF:NeighborhoodMeadowV | -63.43 | 40.57 | -1.563 | 0.1183 |
| TotalSF:NeighborhoodMitchel | -45.37 | 39.4 | -1.152 | 0.2498 |
| TotalSF:NeighborhoodNAmes | -50.89 | 38.86 | -1.309 | 0.1907 |
| TotalSF:NeighborhoodNoRidge | -25.85 | 9.747 | -2.652 | 0.008144 |
| TotalSF:NeighborhoodNPkVill | -69.39 | 71.88 | -0.9654 | 0.3346 |
| TotalSF:NeighborhoodNridgHt | -10.28 | 9.931 | -1.035 | 0.3011 |
| TotalSF:NeighborhoodNWAmes | -37.86 | 39.17 | -0.9668 | 0.3339 |
| TotalSF:NeighborhoodOldTown | -49.08 | 39.04 | -1.257 | 0.209 |
| TotalSF:NeighborhoodSawyer | -53.62 | 39.38 | -1.362 | 0.1736 |
| TotalSF:NeighborhoodSawyerW | -32.78 | 39.16 | -0.837 | 0.4028 |
| TotalSF:NeighborhoodSomerst | -15.67 | 39.12 | -0.4005 | 0.6889 |
| TotalSF:NeighborhoodSWISU | -58.88 | 39.67 | -1.484 | 0.138 |
| TotalSF:NeighborhoodTimber | -9.162 | 39.46 | -0.2322 | 0.8164 |
| TotalSF:NeighborhoodVeenker | 51.96 | 44.09 | 1.178 | 0.239 |
| TotalSF:OverallCond | 3.674 | 1.09 | 3.37 | 0.0007826 |
| TotalSF:TotalBsmtSF | -0.008073 | 0.002233 | -3.615 | 0.0003171 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 969 | 24841 | 0.9134 | 0.9066 |
predicted <- predict(lm_model, newdata=mytest)
predicted <- ifelse(is.na(predicted), mean(predicted, na.rm = TRUE), predicted)
ybar <- mean(mytest$SalePrice)
SSTO <- sum((mytest$SalePrice - ybar)^2)
SSE <- sum((mytest$SalePrice - predicted)^2)
r_squared <- 1 - SSE / SSTO
n <- nrow(mytest)
p <- length(coef(lm_model))
adj_r_squared <- 1 - ((n - 1) / (n - p -1)) * (SSE / SSTO)
validation_results <- data.frame(
Model = "My Model",
`Original R^2` = summary(lm_model)$r.squared,
`Original Adj. R^2` = summary(lm_model)$adj.r.squared,
`Validation R^2` = r_squared,
`Validation Adj. R^2` = adj_r_squared
)
knitr::kable(validation_results, digits = 4)
| Model | Original.R.2 | Original.Adj..R.2 | Validation.R.2 | Validation.Adj..R.2 |
|---|---|---|---|---|
| My Model | 0.9134 | 0.9066 | 0.8596 | 0.8327 |
ggplot(mytrain, aes(x = TotalSF, y = SalePrice)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
labs(title = "SalePrice vs TotalSF", x = "Total Square Feet", y = "Sale Price") +
theme_minimal()
This scatter plot is visualizing the relationship between a house’s total square footage and its sale price, labeled SalePrice vs TotalSF The grey dots are all each corresponding to a specific house. There’s a clear upward trend, indicating a positive correlation between these two variables: as the square footage increases, the sale price generally increases as well. A red line has been fitted to the data, suggesting a linear relationship and potentially representing a linear regression model. While the trend is evident, the scatter of the points shows that other factors influence sale price beyond just square footage.
library(plotly)
plot_ly(mytrain, x = ~TotalSF, y = ~OverallQual, z = ~SalePrice, type = "scatter3d", mode = "markers",
marker = list(size = 3, color = ~SalePrice, colorscale = "Viridis")) %>%
layout(title = "3D Scatter: SalePrice vs TotalSF & OverallQual",
scene = list(xaxis = list(title = "TotalSF"),
yaxis = list(title = "OverallQual"),
zaxis = list(title = "SalePrice")))
This 3D scatter plot illustrates the relationship between Sale Price, Total Square Feet (TotalSF), and Overall Quality (OverallQual) of houses, revealing a strong positive correlation across all three variables. As both TotalSF and OverallQual increase, SalePrice tends to rise, indicating that larger, higher-quality homes command higher prices. The data points cluster in the lower ranges, but distinct outliers, especially those with high SalePrice and OverallQual, suggest premium properties.
library(plotly)
plot_ly(mytrain, x = ~RichNbrhd, y = ~YearBuilt, z = ~SalePrice, type = "scatter3d", mode = "markers",
marker = list(size = 3, color = ~SalePrice, colorscale = "Viridis")) %>%
layout(title = "3D Scatter: SalePrice vs TotalSF & OverallQual",
scene = list(xaxis = list(title = "RichNbrhd"),
yaxis = list(title = "YearBuilt"),
zaxis = list(title = "SalePrice")))
This 3D scatter plot illustrates the relationship between SalePrice, YearBuilt, and RichNbrhd. It suggests a potential trend where newer houses (higher YearBuilt) in richer neighborhoods (higher RichNbrhd) tend to have higher SalePrices. The data points show a concentration towards the lower end of RichNbrhd, indicating that most properties are not in the “rich” neighborhoods. However, there is a noticeable spread of SalePrices across different YearBuilt values, with some newer homes showing significantly higher prices.
library(plotly)
plot_ly(mytrain, x = ~OverallCond, y = ~FullBath, z = ~SalePrice, type = "scatter3d", mode = "markers",
marker = list(size = 3, color = ~SalePrice, colorscale = "Viridis")) %>%
layout(title = "3D Scatter: SalePrice vs TotalSF & OverallQual",
scene = list(xaxis = list(title = "RichNbrhd"),
yaxis = list(title = "OverallCond"),
zaxis = list(title = "FullBath")))
This 3D scatter plot visualizes SalePrice against RichNbrhd and OverallCond, revealing a potential positive correlation between SalePrice and OverallCond. As OverallCond increases, SalePrice tends to rise, indicating higher-condition homes command higher prices. Data points cluster in the lower OverallCond ranges, with outliers at high SalePrice and OverallCond suggesting premium properties.
par(mfrow=c(2,3))
plot(lm_model, which = c(1,2,4,5, 6))
plot(lm_model$residuals)
Residuals vs Fitted: The dots should be spread out randomly. If they’re not, it means our predictions might be off, especially for certain values. Some points, like 8080 and 1325, are far away.
Q-Q Residuals: This checks if the errors are normal. If the dots follow the line, it’s good. Some dots are a bit off, especially 8080 and 1325.
Cook’s Distance: This shows if any single point has a big influence. Point 524 has a very big influence.
Residuals vs Leverage: This finds points that are both far away from the others and have big errors. Points 524, 497, and 1183 are like this.
Cook’s dist vs Leverage: Another way to see those influential points. Point 524 is still the biggest problem.
Residuals over Index: This checks if the errors are random over the data. They look random, but again, 8080 and 1325 are way off.
Original Adjusted R² (0.9066): Indicates the model explains 90.66% of the variance in house prices, adjusting for predictors.
Validation Adjusted R² (0.8327): Shows how well the model generalizes to new data. A decrease from the original suggests slight overfitting, but 0.8327 still indicates strong model performance on unseen data.
Validation R² (0.8596) measures the proportion of variance in the validation data explained by the model. A high value indicates good predictive performance, though it doesn’t adjust for the number of predictors, so it’s best considered alongside adjusted R².
TotalSF: Indicates how much SalePrice changes per additional square foot of total space.
RichNbrhd: Shows the price difference for houses in wealthy neighborhoods compared to others.
YearBuilt: Reflects how much the sale price increases per additional year since the house was built.
OverallQual: Denotes the overall quality of a given home. Similar to OverallCond, which indicates the condition of a home.
Interaction terms: For example, TotalSF:RichNbrhd shows how the effect of square footage on price changes in wealthy neighborhoods.
There are other coefficents used, but I felt these were the most impactful on my model and led to me acheiving the highest R-squared value.